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Instruction Decode Stage 

The instniction decode stage (D stage) 22 decodes an 
instruction code inputted from the IF stage 21. Decoding is 
performed once per clock cycle by using an instruction 
decoder 75 such as a FHW (First Half Word) decoder, a 5 
NFHW (Not First Half Word) decoder, an addressing mode 
decoder or the like of the instruction decode unit 40, and an 
instruction code of 0-six bytes is consumed in one-time 
decode processing (no instruction code is consumed in 
output processing or the like of the step code containing the lO 
return address of the return subroutine instruction). An A 
code, containing address calculation information and a D 
code, containing being an intermediate result of decode of 
the operation code are outputted to the A stage 23 by a 
one-time decode. 15 

In the D stage 22, control of the PC calculation unit 42 for 
each instruction and output processing of an instruction code 
from the instruction queues 72 and 73 are also performed. 

In the D stage 22, preceding jump instruction processing 
(D stage preceding jump) is performed for a jump instruc- 20 
tion. Hie D code and the A code are not outputted in the case 
of a jump instruction having performed a preceding jump, 
except for a conditional branch instrucdon. Processmg of the 
instruction is completed in the D stage 22. 

When a conditional branch instruction has been decoded, 25 
in the D stage 22, the IF stage 21 is directed to fetch 
instmctions from both of the branch destination and the 
non-branch destination. The instruction to be decoded fol- 
lowing the conditional branch instniction is determined 
according to the result of branch prediction. This means that 30 
when the conditional branch instruction is predicted to cause 
a branch, the instruction outputted from the instruction 
queue A 72, fetching the instruction of the branch destination 
is decoded, when the conditional instruction is predicted not 
to branch, the instruction code outputted from the instniction 35 
queue B 73 fetching the non-branch destination instruction 
is decoded 

Operand Address Calculation Stage 

The operand address calculation stage (A stage) 23, is 
divided roughly into two processings. One is a processing 40 
performing latter-stage decode of an operation code by using 
the decoder 76 of the instruction decode unit 40, and the 
other is a processing performing calculation of an operand 
address in the operand address calculation unit 41. 

The latter-stage decode processing of the operation code 45 
inputs the D code, and performs write reservation to a 
register or a memory and output of a R code containing an 
entry address of a microprogram routine, parameters for the 
microprogram and the like. The write reservation to a 
register or a memory is to prevent the content of the register 50 
or the memory referred in address calculation from being 
rewritten by the preceding instruction on the pipeline such 
that a wrong calculation is performed 

The operand address calculation processing inputs the A 
code, performs address calculation of an operand in the 55 
operand address calculation unit 41 according to the A code, 
and outputs the result of the calculation as an F code. Also, 
for a jump instruction, it performs calculation of the jump 
destination address. It makes a check for write reservation in 
reading out the register being attended with address calcu- 60 
lation, and when it determine that a reservation is present 
because the preceding instmction has not completed write 
processing to the register or the memory, it waits until the 
preceding instruction completes the write processing in die 
E stage 27. 65 

In the A stage 23, a preceding jump processing (A stage 
preceding jump) is perfonned for a jump instruction which 



has not performed a preceding jump in the D stage 22. For 
a register indirect jump and a memory indirect jump, the A 
stage preceding jump is performed. For an instruction which 
has performed an A stage preceding jump, the R code and the 
F code are not outputted, and the processing of the instruc- 
tion. is completed in the A stage 23. 
Micro-ROM Access Stage 

The operand fetch stage (F stage) 26 is also divided 
roughly into two processings. One is an access processing of 
the micro-ROM, particularly being called the R stage 24. 
The other is an operand prefetch processing, particularly 
being called the OF stage 25. The R stage 24 and OF stage 
25 are not always operated simultaneously, and their opera- 
tional timings differ depending on miss or hit of the data 
cache, miss or hit of the data TLB or the like. 

Hie micro-ROM access processing which occurs in the R 
stage 24 is a micco-ROM access and a microinstruction 
decode processing for producing an E code which is an 
execute control code to be used for execution in the next £ 
stage 27 for the code $. 

Where processing for one R code is decomposed into two 
or more microprogram steps, the IROM unit 43 and the 
FROM unit 44 are used in the E stage 27, and the next R 
code is sometimes put in a micro-ROM-access-wait state. A 
micro-ROM access for the R code is performed when no 
micro-ROM access in the E stage 27 is performed. In the 
microprocessor, a number of integer operation instmctions 
are performed in one microprogram step, and a number of 
floating point operation instructions are perfonned in two 
microprogram steps. Therefore, micro-ROM accesses for 
the R coded are often perfonned one after another. 

In the processing of the R stage 24, only the IROM unit 
43 is accessed for an instruction not using the floating point 
processing unit 46, and the FROM unit 44 is not accessed. 
The IROM unit 43 and the FROM utut 44 are both accessed 
for an instruction using the floating point processing urut 46 
(floating point operation instmction, integer multiply/divide 
instruction or the like). 

Operand Fetch Stage 

The operand fetch stage (OF stage) 25 performs operand 
pre-fetch processing among the above-mentioned two pro- 
cessings to be performed in the F stage 26. 

In the operand fetch stage 25, the logical address of the F 
code is translated into a physical address by the data TLB, 
the built-in data cache is accessed by the physical address, 
an operand is fetched, and the operand and the logical 
address transferred as an F code are combined and outputted 
as a S code. 

In one F code an 8-byte boundary may be crossed, but an 
operand fetch of eight bytes or less is specified. The F code 
spedfles of whether or not access of tiie operand is to be 
performed. Where die operand address itself and the imme- 
diate value which have been calculated in the A stage 23 are 
transferred to the E stage 27, no operand pre-fetch is 
performed, and the content of the F code is transferred as an 
S code. 

Execution Stage 

The execution stage (E stage) 27 is operated with the E 
code and the S code taken as input Hie E stage 27 executes 
an instruction, and the processings performed in the stages 
before the F stage 26 are all processings for the E stage 27. 
Whai a jump is executed or a HIT processing is started in the 
E stage 27, processings in the IF stage 21 through the E stage 
27 are all invalid Hie E stage 27 is controlled by a 
microprogram, and an instruction is executed by executing 
a series of micro-instructions from the entry address of a 
microprogram routine contained in the R code. 
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The logical address sent to the address translater 54 in this 
manner is first referred to the TLB 55, and where the TLB 
has hit and a physical address is generated, the physical 
address is outputted to the data cache 56. Where the TLB 55 
has missed, in the address translater 54 an external bus s 
access request for rcfeiring to an address translation table on 
the external memoiy is outputted to the bus interface unit 50. 
Then, the reference to the address translation table is com- 
pleted, and the generated physical addresses is outputted to 
the data cache 56. 

When the data cache 56 has hit, the data read out irom the 
data cache 56 is registered into a S code register 60, acting 
as an operand pre-fetch queue. The E stage 27 reads an 
operand from the S code register 60, and performs process- 
ing. When a cache miss has taken place, a register request for 
the block having missed is sent to the bus interface unit 50, 
and the data stored from the outside is inputted to the data 
cache 56, being inputting also to the S code regist^ 60. 

The following describes a processing sequence in the case 
where a store request is sent from the E stage 27. The E stage 
27 writes store data to a DD register 58. writes a store 20 
address to the AA register 59, and outputs an access request 
When the access request is received by the OAU 48, the 
address translater 54, the TLB 55 and the data cache 56 are 
operated like the processing sequence after receiving an 
access from the above mentioned address calculation stage 25 
29. However, in the case of store processing, when the data 
cache 56 has hit, then store data must be written to the data 
cache 56. When the data cache 56 has missed, nothing is 
performed. 

The data cache 56 of the microprocessor uses a write- 30 
through system, and the data is required to be written also to 
the external memory. In general, the external bus cycle is 
slower than the machine cycle of the microprocessor, and 
therefore a storing buffer 57 is installed to prevent the 
internal processing from being speed-regulated by external 35 
bus access. The storing buffer 57 has three entries, and even 
when store data sent firom the B stage 27 strides over a word 
boundary, the data can be registered into the storing buffer 
57 in one time. 

FIG. 8 is a block diagram showing a configuration of the 40 
data cache 56 of the present inventioa This data cache 56 is 
operated in a four-way set associative system of Oth-3rd 
way, and the number of entries of each way is 64. The data 
cache 56 is composed of a first address register 101 tempo- 
rarily storing an address given through the address bus, a 45 
second address register 102 temporarily storing part (six bits 
of lower order) of the address, a transferring path 112 
cormecting the first address register 101 and the second 
address register 102, a tag entry decoder 103 which decodes 
six bits of lower order of the address and selects an entry of so 
a tag memory 104, the tag memory 104, a data entry decoder 
105 selecting an entry of a data memory 106, the data 
memory 106, a comparator 107 and a gate 108. 

Where reading data from an external memory 110 is 
performed as a result of processing an instruction, the 55 
address of the data to be read is sent to the data cache 56 
through an address bus 100, and the total bit width of the 
address (32 bits) is stored in the first address register 101 for 
the tag menroiy 104. Tlie six Iowa* order address bits are 
stored in the second address register 102 for data. Then, the 60 
six lower order address bits of the first and the second 
address register 101 and 102 are sent respectively to the tag 
entry decoder 103 for the tag m^ory 104 and to the data 
entry decoder 105 for the data memory 106, and a signal for 
selecting any of 64 entries is generated. 65 

Next, in the tag memory 104, four tags are read out from 
the selected entry, and the conq)aiator 107 compares the four 
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tags with higher-order bits excluding the above-mentioned 
six lower order bits of the first address register 101. The 
corresponding data read out from the data memory 106 is 
selected by the gate 108, and the data is outputted to the 
processing unit. Where data is read out from the data cache 
56 in such a maimer, the tag- and entry-decoders 103 and 
105 perform identical operations. 

The following provides a description of writing data to the 
external memory 110 as a result of processing an instruction. 
In the case of write processing, an address sent through the 
address bus 100 is stored only into the first address register 
101. Then, like read processing, retrieval of the tag memory 
104 is performed. When the data read out from the tag 
memory 104 and the higher-order bits of tiie first address 
register 101 coincide with each other, six bits of lower order 
of the first address register 101 are transferred to the second 
address register 102. If an address for write processing is 
sent in the next request, the address is stored into the fist 
address register 101. 

Next, in the data memory 106, the transferred lower-order 
address is decoded by the data entry decoder 105, and a 
signal for selecting any of 64 entries is generated Then, 
based on the way information of the tag memory 104, data 
is written to the corresponding data memory 106 firom an 
internal data bus 111 . On the other hand, the tag memory 104 
performs a retrieval of an address corresponding to the next 
write processing, and when they coincide with each other, 
six lower order bits of the first address register 101 are 
transferred to the second address register 102 like the above 
case. 

RG. 9 shows a timing chart of the microprocessor having 
the built-in data cache 56 of the present invention. FIG. 9 
shows the case where a first operation is a read, and the 
second to fotulh operations are writes. Rrst, the first opera- 
tion is a read, and therefore the address thereof is written 
simultaneously to the first address register 101 for the tag 
memory 104 and the second address register 102 for the data 
memory 106. 

Then, a write processmg comes next, and tiierefore the 
address is written only to the first address register 101 for the 
tag memory 104, and is not written to the second address 
register 102 for die data memoiy 106. Then, when the 
address of write has coincided with the tag information 
(cache hit), the address is transferred to the second address 
register 102. At this time, when it is a cache miss, write to 
the data memory 106 of the data cache 56 is not required, 
and therefore the address of the next read processing can 
also be stored into the first address register 101 and the 
second address register 102. The above processing enables 
the cache to make a cache access overlap in one clock period 
for consecutive writes in a cache hit or miss and for a read 
following a write at a cache miss. 

Also, the entry decoders 103 and 105 are provided inde- 
pendentiy for the tag memory 104 and the data memory 106, 
and therefore a bus snooping operation can also be per- 
formed in parallel with a cache write. In a case where the 
microprocessor is connected to anotiier microprocessor on 
an external bus, and accesses a conunon external memory 
(main memory), coherency of the respective data caches 56 
is required to be held, and in the case where the other 
microprocessor renews the main memory, the block of the 
data cache 56 having the area must be invalid. For this 
reason, the first address register 101 for the tag memory 104 
must have a function of inputtii^ an address from the inside, 
and monitoring and storing an external address. In this 
microprocessor, in such a case, even if the first address 
register 101 and the tag entry decoder 103 are used for bus 
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of tag words at a plurality of tag addresses, each of said 
plurality of tag words being identical to said first portion of 
one of said plurality of memory addresses of said main 
memory, said cache memory having a first address register, 
the method comprising: 5 

coupling said first address register to a second address 
register, 

coupling said first address register to a first decoder, said 
first decoder being coupled to said tag memory; 

coupling said second address register to a second decoder, 
said second decoder being coupled to said data 
memory; 

coupling said tag memory to a comparator; 

receiving a first memory address in said first address 
register from said address bus, said first memory 
address being one of said plurality of said memory 
addresses; 

transmitting said first portion of said first memory address 
to said comparator, 

transmitting said second portion of said first memory 
address to said first decoder; 

transmitting said second portion of said first memory 
address to said second address register, 

decoding said second portion of said first memory address 2$ 
to provide a tag address during said first clock period; 

transmitting a first tag word stored in said tag memory at 
said tag address to said compamtor during said first 
clock period; 

comparing, in said comparator, said first tag word with 30 
said first portion of said first memory address and 
outputting a first indication indicating whether said first 
tag word is identical with said first portion of said first 
memory address; 

decoding the contents of said second address register to 35 
provide a first data cache address during said second 
clock period; 

storing data received on said data bus into said data 
memory at said first data cache address during said 
second clock period when said first indication indicates ^ 
that said first tag word is idendcal with said first portion 
of said fint memory address; 

receiving a second memory address in said first address 
register from said address bus during said second clock 
period, said first memory address being one of said 
plurality of said memory addresses; 

transmitting said first portion of said second memory 
address to said compamtor, 

transmitting said second portion of said second memory so 

address to said first decoder. 
8. A cache memory apparatus comprising: 

a data memory means for storing a plurality of data words, 
at least one of said data words being identical to a word 
stored in a main memory at a store address of said main 
memory; 

a tag memory means for storing tag words, at least one of 
said tag words containing at least portions of said store 
address; 

a first address register means for storing first information 
comprising said store address; 

a second address register means for storing second infor- 
mation, said second information being at least part of 
said store address; ^5 

path means for transferring said second information to 
said second address register, 
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first decoder means for decoding said second information 
and selecting an entry of said tag memory to provide a 
selected tag entry; 

second decoder means, different from said first decoder 
means, for decoding the contents of said second 
address register and selecting an entry of said data 
memory to provide a selected data entry; 

comparing means for comparing third information, com- 
prising tag information stored in said selected tag entry 
of said tag memory, with fourth information compris- 
ing that part of the store address which does not include 
the store address decoded by said first decoder means; 

gate means for permitting writing of information to said 
data memory at said selected data entry when said 
comparing means has detected that said third informa- 
tion matches said fourth information; and 

means for storing information, different from said first 
information, in said first address register substantially 
simultaneously with writing of inforniation to said data 
memory at said selected data entry, 

9. A cache memory apparatus comprising: 

a data memory storing a plurality of data information 
words, at least one of said data information words being 
identical to a word stored in a main memory at a store 
address of said main memory, 

a tag memory different from said data memory storing tag 
information words containing at least portions of said 
store address, 

a first address register storing first information comprising 
said store address, said first address register stores a 
store address having a 32-bit length, 

a path for transferring second information to a second 
address register, wherein said second information con- 
tains at least a portion of said store address and includes 
six lower order bits of said 32-bit length of said store 
address, 

a first entry decoder, connected to said path for transfer- 
ring, which decodes said second information and 
selects an entry of said tag memory to define a selected 
entry, 

a second entry decoder, different from said first entry 
decoder, which decodes the contents of said second 
address register, and selects an entry of said data 
memory, and 

a comparator which compares third information compris- 
ing tag information stored m the selected entry of said 
tag memory with fourth information comprising that 
part of the store address which does not include the 
store address decoded by the first entry decoder, 
wherein when an instruction to write information to 
said main memory is executed, said first address reg- 
ister stores said store address, and when said comparing 
means has detected that said third information matches 
said fourth information, said second information is 
transferred to said second address register through said 
path. 

10. The method of claim 7 fiirther comprising the steps of: 

performing a bus snooping operation on said first address 
register and said first decoder. 

wherein said step of performing a bus snooping occurs 
substantially simultaneously with said step of storing 
data into said data memory. 

* * * * 4: 
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ABSTRACT 



An improved instruction tracing mechanism provides a 
combination of hardware, internal to the CPU, and 
novel software. Additional registers are added to inter- 
connected to the CPU. These registers store values 
indicating the instruction address, data address, 
whether the instruction was a load or store, the number 
of bytes moved and whether any address mapping 
changes occurred. The registers are read by a trace 
interrupt handler which then provides the information 
to a trace buffer and a profile buffer. The end user can 
then access the trace and profile information through 
the input/output (I/O) system of the data processing 
system. 
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tion can be used as input to the model and the output 
monitored to determine how well a proposed hardware 
design will nm the program corresponding to the trace 
input This is a great advantage, since hardware design- 
ers have the capability of knowhig, before the chq>(s) 
are even fiabricated, how wdl certain programs will 
operate on the designed system. 

Trace informadon is also used for performance de- 
bugging to look at where the program stores and re- 
trieves operands, when the CPU executed the instruc- 
tions, and the hke. This will allow the hardware design- 
ers to understand how their system works when run- 
ning a particular program. It also gives the software 
designers more information about how thdr program 
operates and lets them optimize its performance. To 
inq>rove both the robustness and p^oimance of the 
trace fedlity of the present invention, the following 
hardware support has been provided in die CPU. The 
central processing unit may be one of the PowerPC 



BiOEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block of a data processing having a central 
processing unit including the trace facility of the pres- 
ent invention; 

FIG. 2 is a block dk^ of the bardware aspect of 20 nricroprocessois avanable from IBM Corp. (PowerPC 
the pi»ent invenUon that is mduded m the central ^ ^ tiidf^ of IBM Corp.). The CPU of flie present 

processmg uni^ and ^ ' ^ 

FIG. 3 con^sting of 3A-3D is a flow chart showing 
the process steps implemented by the software aspect of 
the present invention. 25 



DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 



Generally, tracing tools are used to evaluate p r ogr am 
{^plications running on a particular con:q>uter system. 30 
In particular the instructions to be executed by the CPU 
are analyzed for their content. Another important as- 
pect of tracing is to determine where the operand data 
to be manipulftted by the instruction is stored. That is. 



invention includes a single step intermit capabihty that 
vectors uniquely to a trace exception handler before the 
execution of each next instruction. Additionally, special 
registers, the Saved Instruction Address (SIA) and 
Saved Data Address (SDA) are provided that will con- 
tain the ^ective address of the last architecturally suc- 
cessfully executed instruction and the instruction's op- 
erand. A bit field that can be tested to detomine if the 
last architecturally successfiilly executed instruction 
was a load or store is provided such that these bits are 
used to validate the content of the SDA 
A bit field that can be tested to detemune if the last 



what is tiie address ofthe data to be operated on by the 35 architectuiaUy succe^fimy executed insb^^ 



instruction? This informadon win allow a system de- 
signer to implenitent methods that will improve the 
performance of the software.. These methods include 
q>rovements to the software itself, and inq>rovement$ 
to future hardware systems. 

More specifically, a program to be evaluated is run in 
conjunction with a testing tool, such as a traciiig tod, or 
the like. This tracmg tool will output two basic types of 
information. Fixst, profile data that is essentially a set of 



load or store multiple is provided. These bits are used to 
determine if the nmnber of manory elements accessed 
by the load or store was dynamically determined and 
not easily obtained via post prooesang. Another bit 
40 field is provided that can be tested to determine if the 
last architecturally successfully executed instruction 
altered the effective to virtual address map. This results 
in considerable savings to the trace overhead since con- 
ventional current software testing requires that ad- 
counts corresponding to the instructions executed by 45 dresses in machin e code be .correlated with source code 
the CPU. For example, the profile data may include the 0-^ the corresponding hig^ level language statements 
number of instructions that were executed by a g^ven generate the machine code). Fbxther, a control 

software routine. This information is then used by the mechanism is provided that either allows or disallows 
system de^gner to optimize the performance of the the assertion <^ ^ trace interrupt before execution of 
software and/or hardware nmning the program. That SO each mstruction. 

is, the designer will look at the number of instructions present invention utilizes a software trace algo- 

executed per routine and try to minimize these execu- nthm in conjunction with hardware additions to the 
tions to make the prng ratn run more efiScientiy. Addi* CPU to efficiently provide instruction profiles and 
tionally, the profile data will provide information re- traces in conjunction Mdth data address which the in- 
garding code coverage of the trace, i.e. how many lines 55 structions operated on. When trace is oiabled, a trace 
of code (LOG) executed during the test Thb informa- exception interrupt is taken before each instruction 
tion will ten the designer how reliable the trace test execution. Once the. trace exception handler receives 
was. The more LOG that were executed, the higher the control, a pointer will indicate the address of the next 
test coverage and the better the test instruction to be executed. As the last step of the prior 

Second, a trace tool will output trace information 60 trace interrupt, the previous instruction address will 
which is time sequences of events that occurred on the have been saved in a memory location (Last_lQStruc- 
CPU. More particularly, the trace information includes tion-^ddress). Thus, the memory location will contain 
instructions that executed on the processor and when the .address of the last architecturally suoces^Eully exe- 
these instructions executed. One common used of trace cuted instruction. If &e content of memory location 
information is in the area of hardware Emulation. Often 65 with the last data address represents a break in flow of 
computer hardware designers will create a software the traced instruction stream (e.g. branch etc) tiien the 
model of a chq> bdng designed. Before the design is contents of that memory location are written to the 
actually committed to fabrication, the traces informa- trace medium (typically a bnfifer in memory). 
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If the last atchitectuially saccessfiilly executed in- 
struction was a load or store then the address of the 
operand will be contained in the SD A and will be writ- 
ten to the trace medimn. A bit field, as described above. 



when software tracing, the trace code will be in the 
cache, thus the trace haiKiler CFI may be less than 0.8 
and consequently execute in less that 24.8 cycles. Also, 
sinoe the trace interrupt is a special case, the present 



is then used to make the required determination as to 5 invaition can take into account the special conditions 
whether the mstruction was a load or store. If the num- surrounding a trace interrupt Consequently, it is rea- 
ber of items accessed was dynamically determined, then sonable to believe that it is possible to obtam very at- 
the actual number must be noted on the trace medium. tractive trace execution rates. 

AnothCT bit field, also as described above, is then used A performance contrast exists between the present 
to make Ae required determination as to whether the 10 invention, which utilizes both a novel hardware scheme 
instruction was a load multiple or a store multiple. If the and a new software portion, to that of the typical cache 
addressabiUty (effective to virtual address maj^g) has inhibited external hardware trace scheme. In machines 
been changed then the aspect that changed (eg. BAT or that depend on low cache miss rates, the cache mhibited 
SEG registers) must be noted on the trace medmm. The scheme may result in a severe penalty, i.e. the more 
bit field that determmes if the executed instruction al- 15 code paraDdism a processor can exploit, the more criti- 



cal the cache miss ratio will be to the processor's perfor- 
mance. 

Assuming the following ample linear model of execu- 
tion: 



CFT(Mtffi Rfltinn)'^CPl6t;faj:fy-HMBS-,Ratio X 



The degree to which the processor is slowed down by 



tered the effective to virtual address ms^ is used to 
make the required determination. After utilizing the 
content of the address of the last instruction, the content 
of that memory location is replaced by the content of 
the current instruction address in anticipation of. the 20 
next trace handler invocation. Thus, sequential instruc- 
tion streams are traced. 

There are two major performance tn^rovements. 
The first is that the effort expended in computing the 

last instructions operand address and length has been 25 hardware tracing that disables the cache is: 



reduced to testing Int fields (to determine if the last 
instruction executed was a load or store, or a load or 
store multiple;, respectively). These Int fidds along with 
the SDA register eliminate the need for instrumaitation 
of the loads and stores in the traced code. 

The second performance improvement is that detect- 
ing effective to virtual address mapping changes is very 
efficient, being reduced to testing the bit field that de- 
termines if the effective to virtual address mapping has 
been altered. Since the overall rate of add^sability 35 
changes is typically low, knowing when addressability 
has not changed allows addressability searches to be 
performed only when required thus benefiting perfor- 
mance. 



CPi(NonnaL31is« Ratio) 



30 Applying the linear mod^ to the last equation gives the 
following: 



1 + 



Traoc^kyw-JDown = 



Miss Cost 



Miss Cost 
CPIinfialty 



Assuming that the code being traced has an infinite 
cache CPI of 1.6, the cost of a miss is 16 cycles and the 
A significant improvement in the robustness of the 40 normal miss ratio is 1% then the trace slow down is 



about 10. On the other hand, if the code bdrig traced 
has an infinite cache CPI of 0.25, the cost of a miss is 16 
cycles and the normal miss ratio is 0.5% then the trace 
slow down is about 49.1. 

This last value corresponds to an effective workload 
CFI while hardware tracing of 12.3 cycles per instruc- 
tion. Assuming that a processor capable of sustaining a 
workload CPI of 0.25 when not tracing can also pro- 
vide a trace handler CPI of 0.5, the effective workload 



software tracing aspect of the present invention comes 
about as consequence of the simplifications afforded the 
tracing software by the ability to determine when the 
efiective to virtual addressing has changed. Also, the 
addition of the ability to effidently determine the oper- 45 
and addresses and lengths makes the mechanism which 
eliminates the need to manage the tradng of various 
system components more operable. 

The current estimate for the trace handler of the 
present invention places the path length at IS mstruc- 50 CPI while software tracing would be 19.5 cycles per 
tions ^en there is no write to the trace medium, e.g. instrucdon or about 58% slower than hardware tracing, 
trace buffer, and 20 instructions when fhsre is a write to The above computations are examples intended to 
the trace medium. Assuming a basic block length of 5 show diat in a real processor the slow down imposed by 
instructions, the percentage of load and stores in the extemal hardware tracing can actually turn out to be 
traced instructions stream to be 40% and the percentage 55 comparable to that of software tracing, in other words, 
of taken branches to be 5%, leads to a mean interrupt there is no real advantage of external hardware tracing 
handler path length of 17.25 instructions. over software tracing. Thus, the present inventk>n be- 

Assuming an internq)t latency of 10 cycles and an comes very attractive when compared to the prior art 
interrupt Cycles Per Instructicm (CPI) figure of 0.8 the systems. This is particularly true for machines that ex- 
mean number of cycles added to the execution time of 60 ploit code paraHdism and multiple storage hierarchies, 
each instruction of the workload will be about 23.8 Other limitations such as dectrical loading of high 

speed signals by trace equipment could force the traced 
system to be operated at a reduced clock frequency 
resulting in a further decrease of the speed of hardware 
65 tracing. Also, most RISC processors have external bus- 
ses that run at a fraction of the speed of the processor 
which accentuates the bandwidth requirements of hard- 
ware tradng. Such problems can only worsen as riLa- 



cydes. Since most processor instructions contemplated 
by the present invention can execute in one cycle, the 
effective CPI for the traced workload would be ^>out 
24.8 cycles p^ instruction. 

Note that there is no need to disable the caches in 
software tracing. The least recently used (LRU) re- 
placement strategy of the caches essentially insures that 
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clme clock speeds become higher and packaging be- 
comes more dense. It is inevitable that haidware tracing 
will continue to become increasingly less tractable and 
software tracing increasingly more tractable. 



8 



and a floating point unit (FPU). The FXU being used 
primarily for integer processing, while the FPU is nor- 
mally used for proces^g number in scientific notation. 
Both of these execution units receive instructions firom 



Further, tiie present invention includes a register S an instruction cache unit The FPU and FXU output 



denoted the "Saved Instruction Address" or SIA. The 
SDA is an important aspect of the present invention 
since it will eliminate unnecessary instructions in the 
trace handler. As described above, reducing path length 
reflects directly in the performance of the trace. 

^th the disclosed haidware and software features of 
the presoit invention, useful traces can be inexpensively 
captured with reasonable performance degradation 
from a standard machine 0.e., a TnanTrinf* not dedicated 



information to a bus 104 \Mch is connected to a plural- 
ity of register files. Register 105 will contain the instruc- 
tion address, and register 107 will store the data address 
(address from which the data is retrieved during a load 
10 operation, and the address to winch the data is being 
stored for a store operation). The Load/Store register 
109 will contain a binary value that wiU indicate if ei- 
ther a load or store instruction was exdcatedf by any of 
the execution units in CPU 10. The Num- 



to tradug). A distinct advantage of this is that stand 15 ber—of»Byte&-Moved register 111 will include a value 



representative of the number of bytes of data moved to 
the memory location during a load or store instruction 
Of a load or store instruction was actuaUy executed, as 
indicated by the value in register 109). It should be 
20 noted &at the tenn ''memory location'' as used herem 
win include all of the components of a hierarchical 
memory subsystem, including a level one (LI) cache, 
level two (L2) cache, instruction cache unit, and the 
like. 

Regist^ 113 win contain a spedfic binary value indi- 
cating whether or not the ^ective to virtual address 
m^ing scheme was changed by the previous instruc- 
tion. This information is very important in a tracing 
context, since it may be necessary to use the physical 



alone data processing systems including the processor 
of the present invention can be used to trace instruc- 
tions, without having eidier an external hardware trac- 
ing device, or special purpose tracing software, thus, 
providing more utility to the end user. 

Referring to FIG. % a typical data processing system 
is shown which may be used in conjunction witibi the 
present invmtion. A central processing unit (CPU), 
such as the PowerPC 604 microprocessor (PowerPC is 
a trademaricoflBM) commercially .available from IBM 25 
is provided and interconnected to the various other 
components by system bus 12. It should be noted that 
the hardware trace mechanism of the present invention 
is part of the function v/tach is included in CPU 10. 

Read only memory (ROM) 16 is connected to CPU 10 30 address (data or instruction) to recreate the actual m- 
via bus 12 and incluides the baac input/output system struction that was executed, or data that was accessed. 
(BIOS) tiiat controls the basic computer functions. Ran- A mapping table, or look-up table is used to m^ a 32 bit 
dom access memory (RAM) 14> I/O adapter 18 and effective address into a 52 bit virtual address. The actual 
communications ads^ter 34 are also interconnected to physical address of tiie data or instruction is then deter- 
syst^ bus 12. Expanded memory 15 is additional RAM 35 mined from the virtual address. Periodicany» histruc- 
added to the data procesang system and is also shown tk>ns alt^, or change* the content of the look-up table 
interconnected to bus 12. I/O adapter 18 may be a smaU such that a particular effective address nu^s to a difler- 
coinputer system interface (SCSI) adq>ter that commu- ent virtual address. It is hnperative tiiat the tradng tool 
nicates with a disk storage device 20. Commtmications log the changes to the table occur when they occur, 
ad^ter 34 interconnects bus 12 with an outdde net- 40 Otherwise, tiie actual data or instruction at a specifled 
work enabling the data processing system to oommuni- address may not be obtained by woridng backwards 
cate with other such systems. Iiq)nt/Output devices are from the physical address to the virtual address to the 
also connected to system bus 12 via user inteifEice efifective address. 

adapter 22 and display adapter 36. Keyboard 24, track Trace interrupt handler 115 is a routine contained in 
ban 32, mouse 26 and speaker 28 are aU interconnected 45 the software operating system. The process inqile- 
to bus 12 via user interfrboe adapter 22. Display monitor mented by this program will be more fully described 
38 is connected to system bus 12 by display adapter 36. below in conjunction with the flowchart of FIO. 3. 
In this manner, a user is capable of inputting to tiie Ba^caUy, the trace interrupt handler will read the ad- 
system through the keyboard 24> tradcban 32 or mouse dresses in registers 105 and 107, along with the values in 
26 and receiving output from the system via speaker 28 50 registers 109, 111, 113 and determine whether this infor- 
and display 38.. AdditionaUy, an operating system such mation relates to tiadng or profiling of the instructions 
as I>OS or the OS/2 system (OS/2 is a Trademaxk of oorrespcmding to program 100 that is bdng run. For 
IBM Corporation) is used to coordinate the functions of profile information, the appropriate register contoits 
the various components shown in FIG. L are placed m a profile buffer 117. Trace information is 

Referring to FIG. 2, a CPU is shown including the 55 then stored in trace buffer 119. These bufifers 117 and 
tracing hardware of the present invention. Reference 10 U9 are then accessible by a user by the normal system 
refers g^ierally a CPU as is shown in FIG. L A com- I/O, Le. the results can be viewed on display 38, or 
puter software program 100 is installed and running in stared to disk 20, or the like. 

conjunction witii an operating system 101, which may FIGS. 3A-^D are a flowchart iUustratii^ the opera- 
be one of ADC, DOS, OS/2, or tiie like (AIX and OS/2 60 tion of the trace interrupt handler software of the pres- 
are trademarks of IBM Corp.). The operatii^ system ent invention. Tins software is part of the operating 
101 controls the basic functions of the con^tersysteot system, in a preferred embodiment the AIX operating 
In a preferred anbodiment of tiie present invention, system is oontonplated to be used by the present inven- 
CFU 10 win be one of the IBM reduced instruction set tion. Therefore, interrupt handler 115 is included as part 
computer (RISQ, processing systems as implemented 65 of die ADC operating system. 

in the FOWER and PowerPC Architecture (POWER At step 1 the interrupt handler program is invoked 
and PowerPC are trademarics of IBM Corp.). Execu- and tiie process identifies registers 105, 107, 109, 111, 
tion unit 103 will include both a fixed point unit (FXU) 113 at step 2. It is then determined at step 3 whether the 
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current instroction address in register 105 is the same as 
the last, or previous, instruction address plus one. If not, 
then the routine proceeds to step 4, since the instruc- 
tions are not sequential. This means that a branch or 
system interrupt, or the like, has occurred. Step 5 then S 
determines wh^iier the instruction address information 
is to be used for tracing or profile. If the information is 
to be used for profiling, then the count for last instruc- 
tion address is updated m the profile buffer 117 (step 6). 
However, for tracing the last instruction address itself is 10 
provided to trace buffer 119. Subsequent to both steps 6 
and 7, the process contmues to step 8. Also, if it was 
determined that the instructions were sequential, at step 
3, the method proceeds to step 8. It is then determined 
if the instruction was either a load or a store (step 8). If 15 
so, then step 9 determines whether the information is to 
be used for profile or trace. If the instruction was nei- 
ther a load nor store, the routine skq>s to step 14 (dis- 
cussed below). If the load/store determination informa- 
tion is to be used as profile data, then the count informa- 20 
tion corresponding to the data address, for data being 
operated on by the instruction, is logged in profile 
buffer 117, at step 13. 

However, for trace information, the actual address of 
the memory location at which the data that was oper- 25 
ated on by the mstructions is logged in the trace buffer 
119. Subsequent to step 10, it is then determined if the 
number of bytes moved is equal to 0 (step 11), if not, 
then the number of bytes moved by the load or store 
operation are placed in trace buffer 119 (step 12). If the 30 
number of bytes moved is equal to 0, Le. no bytes were 
moved, then the process continues to step 14. It is then 
determined if an address mapping change has occurred. 
More specifically, if register 113 contains a binary 'T' 
then an address moping change has occurred, Le. the 35 
last instruction changed the effective to virtual address 
nuking. If a logical **<y* is present in register 113, then 
the last instruction did not change address mapping. As 
noted above, an address mapping change is an alteration 
to the look-up table used to translate an effective ad- 40 
dress to a virtual address. If it is detemuned that an 
address mapping change has occurred, then the new 
mapping (relationship of effective address to virtual 
address) is stored in trace buffer 119 step 15). 

If it was determined that the address mapping did not 45 
change, and subsequent to step 15, the current instruc- 
tion address is saved, at step 16, for comparison pur- 
poses. Step 17 sets the last instruction address equal to 
the current instruction address effectively Incrementing 
tiie current instruction address to make it the '^w^ last 50 
instruction address. St^ 18 then determines if there are 
more instructions to process. If so, the procedure re- 
turns to step 3. If not, the process continues to stq> 19 
and ends. 

Following is the pseudocode for the trace interrupt 55 
software of the present mvention. Hiis pseudocode 
describes the activity that would take place in the trace 
intemipt handler, when a trace intmiqvt is incurred. 
Note that the pseudocode assumes the exist^ce of five 
registers, including the instruction address and data 60 
address registers. These registers are illustrated in FIG. 
2, winch is a blocdc diagram of the hardware aspect of 
the present invention. Along with the instruction and 
data address registers, a Load Store regist^ is pro- 
vided that contains a 1 if the last instrudicm was a load 65 
or store, otherwise it contains 0. The Num- 
ber— ofL^ytes_Moved register contains the number of 
bytes moved by the last instruction, if the last instruc- 
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tion was a load or store whose operand size is only 
known at run-time, otherwise it contains 0. The prob- 
lem here is that standard RISC systems have load and 
store instructions whose operand size ^e. number of 
bytes moved) is given by a register value at run-time: 
The Address—M^ping— Change register contains a 1 if 
the last instruction changed the effective-to-virtual ad- 
dress mapping, otherwise, it contains 0. 



Pseadocode 
Registeis: 

Instruction A.ddress 

Data Address 

Load—Store 

Number_of_Bytes..Afoved 
Addrm Mappmg Change 
Variables: 

TA'tt fnsrmcdon^Address (Address of the pievioua^ or 

last instruction executed) 

Code: 

if (Instruction Address) does not equal 

(Last_InstroctioaJAddress + 1) 
/* mstructions were not sequential, branch or 

interrupt took place V 

{ 

if (Tracing) 

/* Record non..j5equeatiality V 

Place Last Tn^tniction Address in trace buffer 

} 

if (Profflir^ 

Update Count for Last Tnstruction^Address 
if (Load_Store equals 1) 
{ 

if (Tracing) 

{ 

Place Z^ta Address in trace bufier 

if (Number^f—Bytes— Moved does itot equal fl) 
Place Nombcr__of__Bytcs in trace boficr 

} 

if (Profiling) 

Update Count for Data Address 

if (Address Ma^)ping-Change equals 1) 

Place new address mapping in trace buffer 

/* save away Instmctron artdrm for companson next 

trmp Arough */ 
Ijut Tnstroctioii— Address = Instmction-^Addres^ 
Resume execution at instruction at 
Instruction— Address 



Although certain preferred embodiments have been 
shown and described it should be understood that many 
changes and modifications can be made therein without 
departing horn the scope of the appended claims. 

We claim: 

1. A method of recording trace information relating 
to the execution of instructions on a processor unit, 
conq)rising the steps of: 

storing instruction information output from at least 
one execution unit in said processing unit; 

determining from said instruction information 
whether said instructions access data from a mem- 
ory subsysten^ and 

determining from said instruction information, stored 
in said processor imit, a data address at which said 
data, operated on by said instructions, was ac- 
cessed. 

2. A method according to claim 1 wherein said step of 
determining whether said instructions access data, com- 
prises the step of determining a quantity of data ac- 
cessed from said memory location, and for storing said 
quantity as trace information. 
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ABSTRACT 



An arithmetic logic unit (239) may be divided into a 
plnralily of htdependent sections (391, 302, 303, 340). Abit 
zero of cany status signal OQaeflponding to each section tiiat 
is stCHcd in a flags register (211), vMdi i^etaMy includes 
more bits than die Tnarrhnmn number of sections of tbt 
arithmetic logic unit (230^ New status signals may over- 
write the pie^ious status signals or rotate die stared bits and 
store die new status signals. A status register (210) stores a 
size indicate tliat d^emdnes die a number of sections of the 
aridunetic logic unit (230). A status detector has a zoo 
detectcK' (321, 322, 323, 324) for each elemoitary section 
^1,302, 309,304) of die mthmetic logic unit (230). When 
diere are fewer than the maximum number of secticms, diese 
zero signals are ANDed (331, 332, 341). A mnhi(teer 
coqdes die cany-cwt of an elanentary (311, 312, 313, 314) 
to the carry-in of an adjacent elementary section (301, 302, 
303, 304) or not depoiding on die selected number of 
sections. The status detector supplies carry outs from each 
demeotary section (301, 302, 303, 304) not coupled to an 
adjacent eloncntaty section (301, 302, 303, 304) to tiie flags 
register ^11). Status signals stored in die flags iegi8te3'<211) 
influence the combination of iiqxrts formed by die arithmetic 
logic unit (230) widun coocspondlng sections. An expand 
drrait (238) esq>ands selected tnts of flags register (211) to 
foam a tlttrd input to a tliree input aridimetic logic unit (230), 

77 Claims, 37 Drawing Sheets 
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